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Abstract 

The geometric multigrid method (GMG) is one of the most efficient solving techniques for 
discrete algebraic systems arising from many types of partial differential equations. GMG uti- 
lizes a hierarchy of grids or discretizations and reduces the error at a number of frequencies 
simultaneously. Graphics processing units (GPUs) have recently burst onto the scientific com- 
puting scene as a technology that has yielded substantial performance and energy-efficiency 
improvements. A central challenge in implementing GMG on GPUs, though, is that computa- 
tional work on coarse levels cannot fully utilize the capacity of a GPU. In this work, we perform 
numerical studies of GMG on CPU-GPU heterogeneous computers. Furthermore, we compare 
our implementation with an efficient CPU implementation of GMG and with the most popular 
fast Poisson solver, Fast Fourier Transform, in the cuFFT library developed by NVIDIA. 
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1 Introduction 



Simulation-based scientific discovery and e ngine ering design demand extreme c omput ing power and 
high-effici ency algorithms (Bj0rstad et al. 1992 : Biorstad. Dryja, and Rahman 20031 : Hey, Tansley, 
and Tolle |2009j; Kaushik et al. 1201 ll : Keyes |201l|) . This demand is one of the main forces driving 
the pursuit of extreme-scale computer hardware and software during the last few decades. It has 
become increasingly important for algorithms to be well-suited to the emerging hardware architec- 
ture. In fact, the co-design of architectures, algorithms, and appellations is particulary important 
given that researchers are trying to achieve exascale (10 18 floating-point operations per second) 
computing. Although the question of what is the best computer architecture to achieve exascale or 
higher remains highly debatable, many researchers agree that hybrid architectures make increasing 
sense due to modern-day energy-consumption constraints, whereby we can no longer reduce voltage 
proportional to the density of transistors. There are already quite a few heterogeneous comput- 
ing architectures available, including the Cell Broadband Engine Architecture (CBEA), Graphics 
Proc essing Units (GP Us), a nd Fiel d Pro grammable Gate Arrays (FPGAs) (Carpenter and Symon 
2009! : Brodtkorb et al. l20ld : Wolfe |20lJ). 

A GPU is a symmetric multicore processor that can be accessed and controlled by CPUs. 
The Intel/ AMD CPU accelerated by NVIDIA/ AMD GPUs is probably the most commonly used 
heterogeneous high-performance computing (HPC) architecture at present. GPU-accelerat ed su - 
percomputers feature in man y of t he top computing systems in the HPC Top500 (Top500 l201lh 
and the Green500 (Green500 l201ll ). It has been reported that many "old" supercomputers, such 
as Jaguar, are currently being redesigned in order to incorporate GPUs and thereby achieve better 
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performance. GPUs have evolved from fixed-pipeline application-specific integrated circuits into 
highly programmable, versatile computing devices. Under conditions often met in scientific com- 
puting, modern GPUs surpass CPUs in computational power, data throughput, and c omputational 
efficie ncy p er watt by almost one order of mag nitude (Buck 120071 : Chen et al. bOQfll : Nickolls and 
Dally E3). 

Not only are GPUs the key ingredient in many current and forthcoming petaflop supercom- 
puters, they also provide an affordable desktop supercomputing environment for everyday usage, 
with peak computational performance matching that of the most powerful supercomputers of only 
a decade ago. General-purpose graphics processing units (GPGPU), as a high-performance compu- 
tational device are becoming increasingly popular. Today's NVIDIA Fermi GPU and the upcoming 
(expected in December 2012) Intel Many Integrated Core (MIC) architecture are the most promis- 
ing co-processors with high energy-efficiency and computation power. The Intel Knight Corner MIC 
(50 cores) is capable of delivering I Teraflop operation in double precision per second, whereas the 
peak performance of Tesla 2090 is 665 Gigaflop operations in double precision per second. On the 
other hand, as GPU has a high-volume graphics market, it is expected by many experts to have a 
price advantage over MIC, at least immediately after its launch. 

Probably one of the most discussed features of the MIC architecture is that it shares the 
x86 instruction set such that users often assume that they do not need to change their existing 
codebase in order to migrate to MIC. However, this assumption is subject to argument as even if 
legacy code can easily be migrated, whether the application it is then used for is able to achieve the 
desired performance is questionable. Achieving scalable scientific applications in the exascale era 
is our ultimate goal. Hence, software, more importantly algorithms, must adapt if it is to unleash 
the power of the hardware. Unfortunately, none of the processors envisioned at present will relieve 
today's programmers from the hard work of preparing their applications. In fact, power constraints 
will actually cause us to use simpler processors at lower clock rates for the majority of our work. 
As an inevitable consequence, improvements in terms of performance will largely arise from more 
parallel algorithms and implementation. 

It is still too early to tell which architecture(s) will dominate the supercomputing market in 
the future. In all likelihood, different applications will benefit from any given architecture in 
specific ways. Thus a one-size-fits-all solution will almost certainly not arise. In many numerical 
simulation applications, the most time-consuming aspect is usually the solution of large linear 
systems of equations. Often, as they are generated by discretized partial differential equations 
(PDEs), the corresponding coefficient matrices are very sparse. The Laplace operator (or Laplacian) 
occurs in many PDEs that describe physical phenomena such as heat diffusion, wave propagation 
and electrostatics, and gravitational potential. For many efficient methods for solving discrete 
problems arising from PDEs, a fast Poisson solver is a key ingredient in achieving a high level 
of efficiency (Xu l20ld ). Numerical schemes based on fast Poisson solvers have been successfully 
applied to many practical problems among whic h are computer tomography, pow er grid analysis, 
and q uantum-chemical sim ulatio n (Kostler et al. 2007 : Sturmer, Kostler, and Rude 20081 : Shi et al. 



20091 : Yang, Cai, and Zhou [2011 



Because of its plausible linear complexity — i.e., the low computational cost of solving a linear 
system with N unknowns is O(N)- — the Poisson solver is one of th e mo st popular GMG meth- 
ods fHackbu schll98.4 Bramble 1993; Briggs, Henson, and McCormickboOO; Trottenberg, Oosterlee, 

). Although the GMG's applicability is limited as it requires ex- 



and Schiiller 2001 



Brandt 2011 



plicit information on the hierarchy of the discrete system, when it can be applied, GMG is far more 
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efficient t han it s algebr aic ve rsion, the algebraic niultigrid (AMG) method (Brandt, McCo rmick , 
and Ruge ll98^; Brandt [l98^; Ruge and Stuben [l987|; Trottenberg, Oosterlee, and Schiiller EoQll). 
Another p opular choice is the direct solver based on the fast Fourier transform or the FFT (Cooley 
and Tukeyll965|) on tensor product grids. The computational cost of the FFT-based fast Poisson 
solver is 0(N log N), and FF T can easily be called from highly optimized software libraries, such 
as FFTW (Frigo and Johnson 120051 ) and the Intel Math Kernel Library (MKL) . These adva ntages 
make FFT an extremely appealing method (Sturmer, Kostler, and Rude 20081 ; Lord et al. [2008) 
when it is applicable. 

It is well-known that heterogeneous architectures pose ne w pr ogramming difficulties compared 
to existing serial and parallel platforms (Chamberlain et al. 2007 : Brodtkorb et al. 2010l ). In this 
paper, we investigate the performance of fast Poisson solution algorithms, more specifically, GMG 
and FFT, on modern massively parallel computing environments like Tesla GPUs. Considerable 
effort has been devoted to developing efficient s olvers for linear systems arisin g from PDEs and 
other applications (Bo lz, Fa rmer, and Grinspun 20031 ; Bell an d Gar land 2008 : Bell and Garland 
20091 : Barrachina et al. |2009j; Jeschke and Cline|2009j; Cao etaL 2010|; Elble, Sahi nidis. and Vouzis 
20ld : Georgescu and Okuda 2010 ; Bell, Dalton, and Olson 201 ll ; Heuveline et al. 2011 ; Heuveline, 
Lukarski, and Weiss l201ll ; Knibbe, Oosterlee, and Vuik l201ll ) and the references therein. 

The main purpose of this paper is to consider the following important questions, all of which 
are central to understanding geometric multigrid methods on GPU architecture: 

• Is it possible to achieve a satisfactory speedup on GPUs for multigrid algorithms? 

• How does the performance of multigrid algorithms on GPUs compare with their performance 
on a single state-of-the-art CPU core? 

• How much of the computational power of GPUs can be used for multigrid algorithms? How 
cost-effective are CPU-GPU systems? 

• Compared with the optimized implementation of direct solvers based on FFT, does GMG 
have any advantages in addition to its generality? 

We will consider answers to these questions based on carefully designed numerical experiments 
described herein. 

The rest of the paper is organized as follows: In Section [21 we introduce the preliminary fea- 
tures of the hardware and algorithms under investigation. In Section [31 we give details about our 
implementation of GMG in a CPU-GPU heterogeneous computing environment. In Section [4j 
we analyze the complexity of the GMG algorithm. We report our numerical tests and analysis in 
Section [5l We then summarize the paper with some concluding remarks in Section El 



2 Preliminaries 

2.1 A Brief Glance at GPU and CUD A 

Graphics processing units (GPUs) recently burst onto the scientific computing scene as an innova- 
tive technology that has demonstrated substantial performance and energy-efficiency improvements 
for many scientific applications. A typical CPU-GPU heterogenous architecture contains one or 
more CPUs (host) and a GPU (device). GPU has its own device memory, which is connected to 
the host via a PCI express bus. One of the main drawbacks of using such an architecture for PDE 
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applications is that it is necess ary to exchange data between the host and the device frequently 
(see Figure [I]) (Brodtkorb et al. l2O10l ). 
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Figure 1: Schem atic o f heterogenous architecture: a quad core CPU in combination with a GPU 
(Brodtkorb et al.[2O10|). Data must be moved to the GPU memory, and parallel kernels are launched 
asynchronously on the GPU by the host. 

In regard to sparse matrix operations, the memory bandwidth is usually where the bottleneck 
occurs. Furthermore, the gap between the speed of float ing-point operations and the speed for 
accessing memory grows every year (Asanovic et al. 20061 ). In this sense, we do not expect that 
the iterative linear solvers, which is usually memory-bounded, to readily derive benefits easily from 
increasing the number of cores. One way to address this problem is to use a high-bandwidth 
memory, such as Convey's Scatter-Gather memory. Another is to add multithreading, where the 
execution unit saves the state of two or more threads, and can swap execution between threads in 
a single cycle. There are two ways to do this — either by swapping between threads at a cache miss 
or by alternating between threads on every cycle. While one thread is waiting for memory, the 
execution unit keeps busy by switching to a different thread. To be effective, though, the program 
must become even more parallel, which will be exploited via multithreading. 

CUDA (Compute Unified Device Architecture) (NVIDIA l2012al ) is a parallel computing plat- 
form and programming model invented by NVIDIA. It delivers dramatic increases in computing 
performance by harnessing the power of the graphics processing unit (GPU). NVIDIA provides a 
complete toolkit for programming on the CUDA architecture, supporting standard computing lan- 
guages such as C/C++ and Fortran. CUDA C and Fo rtran are the most widely used programming 
languages for GPU programming today (Wolfe l20ri). CUDA was developed simultaneously with 
the GeForce 8 architecture (NVIDIA's internal code name for the latter is Tesla), and publicly 
announced in 2006. In addition to C UDA, other options are O penCL, AMD Stream SD K, an d 
OpenACC supported by CAPS (CAPS[20jJ), CRAY (Inc. |20jJ), NVIDIA, and PGI (PGl|20Tj)- 



2.2 The Poisson Equation and Its Finite Difference Discretizations 

Consider the Poisson equation 
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where = (0, l) d C 1 

the Poisson equation (12. ip (Morton and Mayers 20051 ). In other words, the Laplace operator is 



The standard central finite difference method is applied to discretize 
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discretized by the classical second-order central difference scheme. After discretization, we end up 
with a system of linear equations: 

Au = f. (2.2) 

2D Five-Point Stencil 

We use the five-point central difference scheme in 2D. Consider a uniform square mesh of Cl = [0, l] 2 
with size h = — and in which X{ = ih, yj = jh(i,j = 0,1,..., n). Let Uij be the numerical 
approximation of u(xi,yj). The five-point central difference scheme for solving (|2.ip in 2D can be 
written as follows: 

* = f(xi,yj) t,j = l,2,...,ra-l. 



/i 2 

3D Seven-Point Stencil 

We use the seven-point central difference scheme in 3D. Similar to the 2D case, we consider a 
uniform cube mesh of Q = [0, l] 3 with size h = - and in which x\ = ih, yj = jh and Zk = 
kh, i,j,k = 0,1,..., n. Let Mijfc ~ u(xi,yj, Zk) be the approximate solution. The seven-point 
central difference scheme for solving (|2.ip in 3D reads 



~ u i-l,j,k ~ u i,j-l,k ~ u i,j,k-l + ® u i,j,k ~ u i+l,j,k ~ u i,j+l,k ~ u i,j,k+l _ x 

J^2 — J y x i^yji z k) t 

for all i, j, k = 1, 2, . . . , n — 1. 

2.3 Fast Fourier Transform 

A fast Fourier transform (FFT) is an efficient algorithm for computing the discrete Fourier trans- 
form (DFT) and its inverse. DFT decomposes a sequence of values into components of different 
frequencies. Computing DFT directly from its definition is usually too slow to be practical. The 
FFT can be used to compute the same result, but much more quickly. In fact, computing a DFT 
of N points directly, according to its definition, takes 0(N 2 ) ari thmet ical operations, whereas FFT 
can compute the same result in 0{N log N) operations (Walker [l99^) . 



On tensor product grids, FFT can be used to solve the Poisson equation efficiently. We now 
explain the key steps for using FFT to solve the 2D Poisson equation (the 3D case is similar): 

1. Apply 2D forward FFT to f(x, y) to obtain f(k x , k y ), where k x and k y are the wave numbers. 
The 2D Poisson equation in the Fourier space can then be written as 

- Au(x, y) = f(x, y) FFT ) - (k 2 x + k 2 y )u(k x , k y ) = f(k x , k y ). (2.1) 

2. Apply the inverse of the Laplace operator to f(k x , k y ) to obtain u(k x , k y ), which is the element- 
wise division in the Fourier space 

rfh u \ — f(k x ' by) 

U\K X , K y ) — 



-1- 

x ~ ""y 



3. Apply 2D inverse FFT to u(k x ,k y ) to obtain u(x,y). 
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The NVIDIA CUDA Fast Fourier Transform (cuFFT version 4.1) library provides a simple 
interface for computing FFTs up to 10 times faster than MKL 10.2.3 for single precisionQ By 
using hundreds of processor cores on NVIDIA GPUs, cuFFT is able to deliver the floating point 
performance o f a GP U without necessitating the development of custom GPU FFT implementa- 
tion (NVIDIA US]). 



3 Geometric Multigrid Method for GPU 

Multigrid (MG) methods in numerical analysis comprise a group of algorithms for solving differen- 
tial equations using a hierarchy of discretizations. The main idea driving multigrid methodology is 
that of accelerating the convergence of a simple (but usually slow) iterative method by global cor- 
rection from time to time, accomplished by solving corresponding coarse-level problems. Multigrid 
meth ods ar e typically applied to numerically solvi ng dis cretized partial differential equations (Hack- 



busch ll985l ; Trottenberg, Oosterlee, and S chillier l200ll ). In this section, we briefly review standard 



multigrid and full multigrid (V-cycle) algorithms and their respective implementations in a CPU 
GPU heterogenous computing environment. 



3.1 Geometric multigrid method 

The key steps in the multigrid method (see Figure [2]) are as follows: 

• Relaxation or Smoothing: Reduce high-frequency errors using one or more smoothing 
steps based on a simple iterative method, like Jacobi or Gauss-Seidel. 

• Restriction: Restrict the residual on a finer grid to a coarser grid. 

• Prolongation or Interpolation: Represent the correction computed on a coarser grid to a 
finer grid. 
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Figure 2: Pictorial representation of a multigrid method with three grid levels. 

One of the simplest multilevel iterative methods is the multigrid V-cycle (see Figure [3]) . The 
algorithm proceeds from left to right and from top (finest grid) to bottom (coarsest grid) and back 
up again. The V-cycle algorithm can be written as (shown in Figure 3) 



*cuFFT 4.1 on Telsa M2090, ECC on, MKL 10.2.3, and TYAN FT72-B7015 Xeon x5680 Six-Core 3.33GHz. 
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Figure 3: A schematic description of the V-cycle. 



Algorithm 1: Multigrid V-cycle u = MG-V(/z/, L, u, f ) 

1 for I = to L - 2 do 

2 Rcla,xf orward ( iM f ,Ai, fi,ui ) 

3 _ n = fi - Aiui; f l+1 = R\ +1 fi 

4 Relaxy or „ ard ( fif, A£_i, U£,_i ) 

5 for i = L — 1 to do 

6 P' + 1 M; + 1 

7 R.elax5 Bc fei„ ar d( Mi>i Aj, iti ) 



Remark 3.1 (Coarsest-level solver) Note that, for simplicity, we assume that the coarsest level 
L—l contains one degree of freedom. Hence, Relax forwardi fJ-f, A-L-ij /i-i> ^L-i ) hi Algorithm Q] 
solves the coarsest-level problem exactly. The same thing happens in Algorithm [2l 

The full multigrid (FMG) usually gives the best performance in terms of computational com- 
plexity. The idea of FMG is represented in Figure UJ We start from the coarsest grid and solve 
the discrete problem on the coarsest grid. Then, we interpolate this solution to the second-coarsest 
grid and perform one V-cycle. These two steps are repeated recursively on finer and finer grids, 
until the finest grid possible is achieved. The details are described in Algorithm [2l 

Fine 

Relaxation 

O Exact solving 

^ Restriction 

/ Prolongation 

Coarse \f \$ \f \f / FMG Prolongation 

Figure 4: A schematic description of the full multigrid algorithm. The algorithm proceeds from 
left to right and from top (finest grid) to bottom (coarsest grid). 




S 



Algorithm 2: Full Multigrid V-cycle u = FMG-V(/i/,^, L, u, f ) 



1 init fi,ui, I = 0, . . . , L — 1 

2 RelaxGS /ortU£lrd ( M/,A L _ 1 ,/ i „i,u L _ 1 ) 

3 for { = L - 2 to do 

4 u ; = P; +1 w i+ i 

5 L «l =y-cyc^e(fi f ,fi b ,l,ui,fi ) 



3.2 Implementation of GMG on a CPU— GPU machine 
Algorithm 3: u,iter = GMGSolve(c?, L, Lg, tols, maxit, /if, fib) 

1 for I = to Lg do: init «; and resinit = ||/o — Auo|| 

2 for £ = Lg to L — 1 do: init C; and /; ; res = resinit; iter = 

3 while (res > tols X resinit)and(iter < maxit) do 

4 for Z = to L fl - 1 at GPU do 

5 RelaxGS /oru , ard ( fif t Ai,fi,Si ); n=fi- AjMj; = Rj +1 r ; 

6 copy /f, from DEVICE memory to HOST memory 

T n Le = MG-y(tif,fi b ,Lg,u Le ,f Lg ) 

8 copy uld from HOST memory to DEVICE memory 

9 for Z = Lg - 1 to at GPU do 

10 ui = ui + P\ +1 ui +1 \ Rela^GS backward ( /j, b ,Ai, f h ui ) 

11 res = ||/ — AuoH 

12 iter = iter + 1 

13 copy flo from DEVICE memory to HOST memory u 



Remark 3.2 We offer some remarks about our implementation: 

1. There have been many discussions on how to implement geometric multigrid methods effi- 



ciently on modern computer architectures; see, for example, Weiss 12001 



2. We use a four-color and an eight-color Gauss-Seidel smoother for RelaxGS in 2D and 3D, 
respectively. As we are considering structured grids and the coloring is easy to obtain, we 
prefer the GS smoother over the weighted Jacobi smoother. A weighted Jacobi smoother is 
likely to achieve a higher peak performance and higher speedup over the corresponding CPU 
version; however, it usually requires more iterations and wall time compared with the colored 
GS smoother if both methods use same multilevel iteration, like V-cycle. 

3. The gray boxes represent the code segments running on GPU (kernel functions). When 
Lg = 0, Algorithm [3] runs on CPU completely, and when Lg = L, Algorithm [3] runs on GPU 
solely. However, when < Lg < L, these functions run in CPU-GPU hybrid mode. 

4. Graphics processors provide texture memory to accelerate frequently performed operations. 
As optimized data access is crucial to GPU performance, the use of texture memory can 
sometimes provide a considerable performance boost. We band the vectors ui as one dimension 
texture memory in function 77 = // — A^, I = 0, 1, . . . , L — 2. 
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4 Complexity Analysis of the GMG Algorithm 



For the geometry multigrid of the finite difference method on structured meshes, it is not necessary 
to explicitly assemble the global stiffness matrix A (i.e., matrix- free) . A subroutine for the matrix- 
vector multiplication of the corresponding finite difference operator is called whenever we need to 
compute r = f — An. This subroutine can be implemented directly from the central difference 
scheme. 

One V-cycle of the GMG algorithm consists of computing the residual, forward relaxation, 
backward relaxation, restriction, prolongation, and the inner product. Take the 2D case as an 
example: we can analyze the time and space complexity of these operations. The time complexity 
of one V-cycle can be proven to be O(N) for b oth th e 2D and 3D cases. The optimal complexity 
of GMG has been analyzed by Griebel (Griebel[L989|). 



Next, we analyze the exact operation counts for a V-cycle. 

2D case 

• Residual: fj = fi — A;-u; 

r ij = fli ~ Hi + u i±i,j + 4,j±i, U = 1> 2, • • • , n t - 1, (4.1) 

where u l i±1 • = u[_ 1 j + u\ +l j and in this case when i — 1 = or i + 1 = ni + 1, then u\ ±l j = 0. 
From now on, we will use the notation ± in this section. The equation (I4.ip requires 6 
floating-point operations or work units (W) per unknown in the 2D case. Furthermore, we 
obtain the total number of floating-point operations for the residual in one V-cycle as 

L-2 

W Residual = Q^ n i) 2 - ( 4 - 2 ) 

1=0 

• Gauss-Seidel Relaxation: 

u \,3 = ^Ul,j + A±i,j + u ij±i)^ i,j = l,2,...,ni-l. (4.3) 

Equation (j4.3|) shows that there are 5 floating-point operations per unknown in the 2D case. 
Hence, the total number of floating-point operations in the forward and backward Gauss- 
Seidel relaxation in one V-cycle is 

L-l L-2 
WcSforward = 5 ^(n Z ) 2 , W G Sbackward = 5 ^(n Z ) 2 . (4.4) 

1=0 1=0 

• Restriction: r/ +1 = R| +1 fj 

= s(2 r 2i,2j + r 2i±l,2j + r 2i,2j'±l + r 2i-l,2j-l + r 2i+l,2j+l) j hj = 1, 2, . . . , 72; — 1. (4.5) 
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Equation (|4.5p requires 8 floating-point operations per unknown in the 2D case. Furthermore, 
we obtain the total floating-point operations of restriction for one V-cycle as 

L-l 

WResitricUon = 8 ^(nj) 2 . (4.6) 
1=1 

• Prolongation: e\ = ej + Pj +1 e*; + i 



'2i,2j - 

J 

-2i.2j+l 



- J 



+ e 



l+l 

i,j ■ 



+ 



2 v i,j 



-2i+l,2j 
J 



2i-\ 



l,2j+l 
= 1,.. 



3 2i+l,2j + 2 ' 
_ e 2i+l,2j+l 
, »Z+1 - 1 



,'+1 



+ 



(eg 1 + J+1 



(4.7) 



Equation (|4.7p shows that there are 3x ^ +1 floating-point operations per unknown for the 2D 
case. Furthermore, we can obtain the total number of floating-point operations of prolongation 
for one V-cycle as 



L-2 



^ Prolongation = 2.5 ^(n;) 2 . (4. 
1=0 

Compute the norm of the residual: || • || 



FoIIl 2 



(no) 2 



(4.9) 



From equation (|4.9p . we can see that the total number of floating-point operations for com- 
puting £ 2 -norm is 2(no) 2 . 

By combining equations (|4.2|) . (|4.4|) . (|4.6|) . (|4.8|) . and (|4.9|) . we can obtain the total number of 
floating-point operations per unknown for the 2D case in one V-cycle as 

6E("l) 2 + 5 EW + 5 EW + 8 EW + f eW 2 + (2 + 6)(n ) 2 

«=0 «=0 1=0 1=1 1=0 or 



This means the total number of floating-point operations per unknown required by one V-cycle in 
the 2D case is about 36. 
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Figure 5: 2D stencil for restriction (left) and prolongation (right). Blue circle: fine-grid points; red 
cross: coarse-grid points. 

3D case 

Similarly, we can count the complexity of a V-cycle in 3D: 
• Residual: 77 = // — Afii 



fi,j,k ~ ® u i,j,k + u i±l,j,k + u i,j±l,k + u i,j,k±l> i,j,k — 1, . . . ,ni - 1. (4.10) 



L-2 



^esid„a/ = 8 (n,) 3 + 8(n ) 3 . 



1=0 



Gauss-Seidel Relaxation: 



u i,j,k 



1 



^(fUk + A±xj,k + u ij±i,k + u i,j,k±i) i,j,k = l,...,ni-l. (4.12) 



(4.11) 



L-l 



L-2 



1=0 



1=0 



Was forward = 7^(nj) 3 , WcSbackward = 7^J(n;)" 

• Restriction: rj+j = R.J +1 fj 



(4.13) 



1 



'"• ' "^^2^2^,2*; + r 2i±l,2j,2k + r 2i,2j±l,2fc + r 2i,2j,2fc±l + r 2i,2j-l,2k-l 



r /+l ^ 



~ r 2i,2j+l,2k+l + r 2i~l,2j,2fc-l + r 2i+l,2j,2/c+l + r 2i-.l .2./- I..2A- 
" r 2i- J,2j - !,2fe + r 2i-l,2j-l,2fc-l + r 2j+l 2 , + .2/.-+ : I ■ 

for i,j,k = l,...,ni-l. 



Then we obtain that 



L-l 



esitriction 



16£(n,) E 



(4.14) 



(4.15) 
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• Prolongation: ei = e\ + P ; ej+i 



23 L ~ 2 

^Prolongation = X^) 3 " ( 4 - 16 ) 



1=0 

Compute the norm of the residual: || • || 



Fo||l2 



(no) 3 

£ 

.7 = 1 



(4.17) 



By combining (|4.1ip . (|4.13p . Q4.15j) . (|4.16[) . and (|4.17p . we obtain the total number of floating- 
point operations per unknown for the 3D case in one V-cycle as 



8 ZV) 3 + 7 Z(nif + 7 eW + 16 EV/) 3 + f EM 3 + (2 + 8)(n ) J 
i=o 2=0 «=o |=i i=o 

(no) 3 



41. 



Hence, the total number of floating-point operations per unknown for one V-cycle is 36 and 41 for 
the 2D and 3D cases, respectively. 

Remark 4.1 (Space complexity of V-cycle GMG) In GMG Algorithm O we need only keep 
tii, fi (I = 0, 1, . . . , L — 1) and rj (/ = 0, 1, . . . , L — 2) in the host or device memory. Therefore, we 
obtain the memory space complexity of GMG Algorithm [3] as follows: 

Memory /N = — — '~° ^ 4. (4.18) 

Equation (|4.18p shows that the memory space complexity of GMG Algorithm [3] has about 4 times 
as many unknowns for both 2D and 3D. Equation (|4.18p also shows that the space complexity of 
the GMG V-cycle is O(N). 



5 Numerical Experiment 

In this section, we perform several numerical experiments and analyze the performance of GMG as 
proposed in Algorithm [3l 



5.1 Environment for Comparisons 

Our focal environment is a low-cost commodity-level NVIDIA GPU together with a HP computing 
workstation. Details in regard to the machine are set out in Table HJ From this table, we notice 
that the initial and energy-consumption costs of the particular GPU we use is roughly 2 times of 
the CPU costs. 
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Table 1: Experiment Environment 



CPU Type 


AMD FX-8150 8-core 


CPU Clock 


3.6 GHz x 8 cores 


CPU Energy Consumption 


85 Watts (idle) ~ 262 Watts (peak) 


CPU Price 


300 US Dollars 


Host Memory Size 


16GB 


GPU Type 


NVIDIA GeForce GTX 480sp 


UrFU Clock 


1.4 (jt±z x 15 x 32 cores 


GPU Energy Consumption 


141 Watts (idle) - 440 Watts (peak) 


GPU Price 


485 US Dollars 


Device Memory Size 


1.5GB 


Operating System 


CentOS 6.2 


CUDA Driver 


CUDA 4.1 


Host Compiler 


gcc 4.4.6 


Device Compiler 


nvcc 4.1 



For our numerical experiments, we chose to use AMD FX(tm)-8150 Eight-Core 3.6GHz CPU (its 
peak performance in double precision is 1.78GFLOPs3) and the NVIDIA GeForce GTX 480 GPU. 
GTX 480 supports CUDA and it is composed of 15 multiprocessors, each of which has 32 cores 
(480 cores in total) . Each multiprocessor is equipped with 48KB of very fast shared memory, which 
stores both data and instructions. All the multiprocessors are connected to the global memory, 
which is understood as an SMP architecture. The global memory is limited to a maximum size of 
1.5GB. However, there is also a read-only cache memory called a texture cache, which is bound to a 
part of the global memory when a code is initiated by the multiprocessors. The main performance 
parameters of GeForce GTX 480 is described in Table [2fl 

Table 2: Theoretical peak performance of NVIDIA GeForce GTX 480 



Single-precision performance [GFLOPs] 1300.00 

Double-precision performance [GFLOPs] 177.00 

Theoretical memory bandwidth [GB/s] 177.00 

Device to device memory bandwidth [GB/s] 148.39 

Device to host memory bandwidth [GB/s] 4.46 

Host to host memory bandwidth [GB/s] 9.44 

Host to device memory bandwidth [GB/s] 3.92 



Remark 5.1 (Multicore effect) We note that, in all our numerical tests on CPUs, we use only a 
single CPU core in order to provide an unbiased benchmark on CPUs. The reason is that we want 
to eliminate the effect of different multi-threaded implementations of GMG on multicore CPUs. It 
is fair to say that on AMD FX8150 (8-core) the speedup of eight-thread GMG (over one thread 
version) is, in general, 2.0 to 4.0, depending on the algorithm and implementation. 

For test purposes, our focus is the simple model problem (|2.ip in the 2D and 3D cases. To be 
more specific, we consider the following two cases: 

^Obtained experimentally using LINPACK (http://www.netlib.org/benchmark/linpackc). 

^Numbers in the last four rows are obtained experimentally using the bandwidthtest of CUDA 4.1 SDK. 
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Example 5.1 For the model problem \2.1\ let 



n = (o,i) 2 c m 2 , 

f(x, y) = sin(vrx) sin(vry), (x, y) G fi. 
Example 5.2 For the model problem \2.1[ let 

n = (o,i) 3 c m 3 , 

/(x, y, z) = sin(-7rx) sin(7ry) sin(7rz), (x, y, z) € Vt. 
The tolerance for the convergence is tol = 10~ 6 and (j,f = ^ = 1. 

5.2 CPU v.s. GPU 

First, we test the number of iterations and the discretization error of the finite difference schemes 
on CPUs and GPUs. Tables [3] and H] show that both the GPU version and the CPU version of 
GMG achieve the optimal discretization error, 0{h 2 ), in R 2 and M 3 . Furthermore, the GPU and 
the CPU versions take the same number of iterations as each other to reach the given convergence 
tolerance, i.e., the GPU version is equivalent to the corresponding serial version. 



Table 3: The iteration numbers, discretization errors, and convergence rates of V-cycle in 2D 



N 



(2 B +T) 2 
(2 9 + l) 2 
(2 10 + l) 2 
(2 11 + l) 2 
(2 12 + l) 2 



CPU 



#It 



u 



Uh 



u-u 2h 



\u-u h I 



11 6.250e-6 

11 1.565e-6 3.99 

11 3.910e-7 4.00 

11 9.719e-8 4.02 

11 2.370e-8 4.10 



GPU 



#It 



Uh\ 



11 
11 
11 
11 
11 



6.250e-6 
1.565e-6 
3.910e-7 
9.719e-8 
2.370e-8 



u~u 2h 



\u-u h \ 



3.99 
4.00 
4.02 
4.10 



Table 4: The iteration numbers, discretization errors, and convergence rates of V-cycle in 3D 



N 



(2 6 + r 

(2 7 + 1; 
(2 8 + 1; 



CPU 



#It 



\u - u h \ 



U-U 2 h 



W-u h 



15 2.713e-4 

15 6.936e-5 

15 1.753e-5 

15 4.404e-6 



3.91 
3.97 
3.98 



GPU 



#It 



u-u h 



1 1 


U-U 2 h 




\u-u h \ 



15 2.713e-4 

15 6.936e-5 3.91 

15 1.753e-5 3.97 

15 4.404e-6 3.98 
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Figure 6: Speedup (left) of GMG on GPU compared with its CPU version and performance (right) 
of GMG on GPU in 2D 

Second, we compare the computing time for the GPU version and the CPU version of GMG. 
For the 2D test problem, in the best-case scenario, the GPU version of GMG can achieve 18.49 
times speedup, which is 40% of the SpMV operation on the finest grid (see Tableland Figured]). 
The extent of speedup indicates that when L < 10, the maximum speedup is archived at Lg = 3 or 
4; however, when L > 10, the maximum speedup is archived at Lg = L (all run on GPU). Moreover, 
the speedup increases as L increases. In this case, we can obtain 15.17 GFLOPs in double precision, 
which is 8.6% of the theoretical peak performance of GTX 480 (see Figured]). 

Table 5: Wall time, GFLOPs (double precision), and speedup for computing the residual in 2D 



L 


CPU 


GPU 


Speedup 


time (s) 


GFLOPs 


time (s) 


GFLOPs 


8 


1.5440e-4 


2.57 


1.6148e-5 


24.54 


9.56 


9 


1.7865e-3 


0.88 


4.9640e-5 


31.81 


35.99 


10 


7.8907e-3 


0.80 


1.8124e-4 


34.78 


43.54 


11 


3.2138e-2 


0.78 


7.0802e-4 


35.58 


45.39 


12 


1.3021e-l 


0.80 


2.8167e-3 


35.75 


46.23 



Similarly, for the 3D test problem, in the best-case scenario, we can achieve 15.99 times speedup, 
which is 61% of the SpMV operation on the finest grid (see Figure [7] and Table EJ. Similar to the 
2D case above, if L < 8, then Lg = 2 or 3 gives the best speedup. However, if the size of the 
problem is large enough, then we should run it completely on GPU. In this case, 13.59 GFLOPs 
double-precision operations are performed every second, which is approximately 7.6% of the peak 
performance of GTX 480 (see Figure [7]). 
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Figure 7: Speedup (left) of GMG on GPU compared with its CPU version and performance (right) 
of GMG on GPU in 3D 



Table 6: Wall time, GFLOPs (double precision), and speedup for computing the residual in 3D 



L 


CPU 


GPU 


Speedup 


time (s) 


GFLOPs 


time (s) 


GFLOPs 


5 


1.2181e-04 


2.36 


2.4494e-05 


11.79 


4.97 


6 


1.0592e-03 


2.07 


8.9253e-05 


24.58 


11.87 


7 


9.6409e-03 


1.78 


5.2051e-04 


32.99 


18.52 


8 


1.3772e-01 


0.99 


5.1200e-03 


26.53 


26.90 



5.3 Performance of GMG on GPUs 

In this subsection, we would like to understand more about which part(s) of the GPU implementa- 
tion are the bottleneck(s). Tables [7] and [8] show the kernel time and communication time in 2D and 
3D, respectively. (In these two tables, Total = Kernel time + Communication time.) The numerical 
results show that the kernel computation takes 75% to 90% of the total wall time. Furthermore, 
the portion of communication time decreases as problem size increases if we consider problems with 
a large degree of freedom. 



Table 7: Kernel time (seconds) and communication time (seconds) in 2D 



L{L g = L) 


8 


9 


10 


11 


12 


Kernel time 


5.934e-3 


1.064e-2 


3.134e-2 


1.124e-l 


4.339e-l 


Communication time 


9.070e-4 


3.491e-3 


5.884e-3 


1.522e-2 


5.169e-2 


Communication /Total 


13.26% 


24.70% 


15.81% 


11.93% 


10.64% 
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Table 8: Kernel time (seconds) and communication time (seconds) in 3D 



L(Lg = L) 


5 


6 


7 


8 


Kernel time 


8.544e-3 


2.404e-2 


9.997e-2 


7.287e-l 


Communication time 


4.938e-4 


3.716e-3 


9.574e-3 


5.141e-2 


Communication /Total 


5.46% 


13.39% 


8.74% 


6.59% 



Tables [9] and [10] show the wall time (ratio to the total kernel time) for each function of one 
V-cycle in 2D and 3D, respectively. Tables [11] and [12] show the GFLOPs for each function of 
one V-cycle in 2D and 3D, respectively. The numerical results show that the multicolored Gauss- 
Seidel smoother (GSforward and GSbackward) counts for more than 50% of the total kernel time. 
Furthermore, because we are using the multicolored GS smoother, we need to launch the GS kernel 
several times (equal to the number of colors) — this introduces larger overhead. On the other hand, 
as we noted, using the weighted Jacobi method does not help, although it could result in better 
parallelism. Another observation is that computing the Euclidean norm of the residual gets very 
low efficiency due to its high memory-access/computation rate. 



Table 9: Wall time ratio in the kernel time of each subroutine in one V-cycle in 2D 



L(Lg = L) 


8 9 10 11 12 


Residual 


16.41% 16.35% 19.64% 21.64% 22.11% 


GSforward 


29.42% 29.39% 25.74% 25.57% 25.55% 


GSbackward 


23.29% 24.55% 24.97% 24.61% 24.85% 


Restriction 


14.82% 14.25% 8.90% 6.53% 5.99% 


Prolongation 


11.01% 10.47% 10.97% 10.30% 9.92% 


Norm 


5.06% 4.99% 9.78% 11.35% 11.59% 



Table 10: Wall time ratio in the kernel time of each subroutine in one V-cycle in 3D 



L(L e = L) 


5 6 7 8 


Residual 


12.76% 13.90% 19.07% 25.34% 


GSforward 


33.00% 30.59% 27.75% 27.21% 


GSbackward 


26.71% 27.76% 26.75% 26.78% 


Restriction 


10.03% 7.82% 5.69% 4.38% 


Prolongation 


13.67% 16.96% 16.57% 15.69% 


Norm 


3.82% 2.97% 4.16% 2.79% 
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Table 11: Performance (GFLOPs) of each subroutine in one V-cycle in 2D 



T ( T T \ 

L(L e = L) 


t~\ J~\ -i /~\ -1-1 -1 f A 

8 9 10 11 12 


Residual 


6.46 8.75 21.23 26.01 27.12 


GSforward 


1.71 2.32 7.70 10.46 11.14 


GSbackward 


2.15 2.77 7.94 10.91 11.53 


Restriction 


1.37 1.92 8.95 16.44 19.09 


Prolongation 


2.57 3.62 10.01 14.33 15.87 


Norm 


3.04 4.14 6.11 7.13 7.39 



Table 12: Performance (GFLOPs) of each subroutine in one V-cycle in 3D 



L(L = L) 


5 6 7 8 


Residual 


6.64 19.24 28.96 24.81 


GSforward 


1.20 4.09 9.31 10.81 


GSbackward 


1.49 4.51 9.64 10.98 


Restriction 


2.58 10.49 15.46 25.78 


Prolongation 


1.25 4.82 13.32 19.43 


Norm 


1.20 3.04 6.41 7.66 



Tables [T31 and [T41 show that the memory complexity of Algorithm [3] is O(N) in both the 2D and 
3D cases. Furthermore, the constant in O(N) is also very small — less than 4. 

Table 13: GPU device memory usage (double precision floating-point numbers) for 2D GMG 



L 


8 


9 


10 


11 


12 


N 


66049 


263169 


1050625 


4198401 


16785409 


Memory usage 


242865 


966323 


3855029 


15399607 


61557433 


Memory usage/ N 


3.6770 


3.6719 


3.6693 


3.6680 


3.6673 



Table 14: GPU device memory usage (double precision floating-point numbers) for 3D GMG 



L 


5 


6 


7 


8 


N 


35937 


274625 


2146689 


16974593 


Memory usage 


119399 


907337 


7072779 


55849869 


Memory usage/iV 


3.3225 


3.3039 


3.2947 


3.2902 



5.4 FMG vs. FFT 

Now we compare the FFT method with the geometric multigrid method as fast Poisson solution 
methods. FFT is a direct solver, and multigrid is an iterative solver. Therefore, making a fair 
comparison between the two is not an easy task. We set up our comparison in the following 
way: we tested a sequence of FMG methods, each of which had a different number of pre- and 
post-relaxation sweeps. Then we compared FFT with the most efficient FMG scheme in order 
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to determine which gives the optimal approximation error. As FFT and FMG require the same 
amount of data to be transmitted, we only compare the respective kernel times here. 

We consider cases with 16 million unknowns in 2D (L = 12) and 3D (L = 8). For the 2D case, 
from Tables [TBI and [T6l we notice that FMG(1,2) is enough to guarantee the optimal convergence 
of the approximation error in L 2 (Cl). On the other hand, for the 3D case, we need to use at least 
FMG(3,3) in order to obtain the optimal convergence rate (see Tables [T71 and fT8l) . Moreover, the 
optimal FMG is 33% and 23% faster than FFT in 2D and 3D, respectively. 



Table 15: Approximation error \\u — Uh\\ in 2D 



L 


FFT 


FMG(1,1) 


FMG(1,2) 


FMG(2,2) 


FMG(2,3) 


FMG(3,3) 


9 


1.563e-6 


1.001e-5 


1.242e-6 


1.004e-6 


7.028e-7 


7.145e-7 


10 


3.914e-7 


2.618e-6 


3.113e-7 


2.518e-7 


1.762e-7 


1.790e-7 


11 


9.797e-8 


6.766e-7 


7.791e-8 


6.304e-8 


4.411e-8 


4.479e-8 


12 


2.450e-8 


1.735e-7 


1.948e-8 


1.577e-8 


1.103e-8 


1.120e-8 




Table 16: Kernel time (seconds) of FFT and FMG in 


2D 


L 


FFT 


FMG(1,1) 


FMG(1,2) 


FMG(2,2) 


FMG(2,3) 


FMG(3,3) 


9 


3.739e-3 


3.611e-3 


4.260e-3 


4.980e-3 


5.617e-3 


6.348e-3 


10 


1.102e-2 


7.434e-3 


8.770e-3 


1.008e-2 


1.144e-2 


1.282e-2 


11 


4.077e-2 


2.203e-2 


2.571e-2 


2.945e-2 


3.317e-2 


3.701e-2 


12 


1.364e-l 


7.860e-2 


9.167e-2 


1.049e-l 


1.180e-l 


1.310e-l 






Table 17: Approximation 


error \\u — 


Uh\\2 in 3D 




L 


FFT 


FMG(1,1) 


FMG(1,2) 


FMG(2,2) 


FMG(2,3) 


FMG(3,3) 


5 


2.841e-4 


6.509e-3 


2.733e-3 


1.246e-3 


7.873e-4 


5.296e-4 


6 


7.100e-5 


2.685e-3 


9.469e-4 


3.930e-4 


2.426e-4 


1.608e-4 


7 


1.774e-5 


1.032e-3 


2.988e-4 


1.125e-4 


6.751e-5 


4.394e-5 


8 


4.437e-6 


3.803e-4 


8.880e-5 


3.049e-5 


1.784e-5 


1.145e-5 




Table 18: Kernel time (seconds) of FFT and FMG in 


3D 


L 


FFT 


FMG(1,1) 


FMG(1,2) 


FMG(2,2) 


FMG(2,3) 


FMG(3,3) 


5 


5.102e-4 


1.611e-3 


1.932e-3 


2.382e-3 


2.738e-3 


3.186e-3 


6 


1.890e-3 


3.711e-3 


4.474e-3 


5.335e-3 


6.098e-3 


6.986e-3 


7 


5.884e-2 


1.342e-2 


1.586e-2 


1.846e-2 


2.094e-2 


2.352e-2 


8 


1.893e-l 


8.566e-2 


1.007e-l 


1.155e-l 


1.302e-l 


1.456e-l 



6 Conclusion 

In this work, we studied the performance of GMG on CPU-GPU heterogenous computers. Our 
numerical results suggest that in the best-case scenario the GPU version of GMG can achieve 18.5 
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times speed-up in 2D and 16.0 times speed-up in 3D compared with an efficient implementation of 
multigrid methods on CPUs. When the problem is relatively small we found that the heterogenous 
algorithm (0 < Lq < L) usually gives the best computational performance. On the other hand, 
when the problem size is large enough, then we concluded that it is generally preferable to do the 
computation on GPUs. We observed the smoothing and computing norm of the residual account for 
the low floating-point performance and low efficiency of GMG on GPUs. Furthermore, we compared 
our method with the Fast Fourier Transform in the state-of-the-art cuFFT library. For the test 
cases with 16 million unknowns (L = 12 in 2D and L = 8 in 3D), we showed that the optimal FMG 
method is 33% and 23% faster than FFT in 2D and 3D, respectively. Of at least equal importance 
is that GPU is more cost-effective (in terms of initial cost and daily energy consumption) than 
modern multicore CPUs for geometric multigrid methods. 
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Nomenclature 



Name 


type 


size 


Brief Description 


d 


int 


scalar 


Spatial dimension: 2 for 2D and 3 for 3D 


L 


int 


scalar 


Total number of grid levels 


U 


int 


scalar 


Critical grid level between CPU and GPU computing 


hi 


double 


scalar 


Grid size of the Z-th level hi — 


ni 


int 


scalar 


Grid size of each direction of the Z-th level ni = T — hi 


N 


int 


scalar 


Number of unknowns N = (no) 




double 


scalar 


Number of forward relaxation sweeps 




double 


scalar 


Number of backward relaxation sweeps 


A ; 


null 




Grid operator of No. 1 level 




null 




Restriction operator from the Z-th level to the (Z + l)-th level 


pi+i 


null 




Prolongation operator from the (Z + l)-th level to the Z-th level 


Ul 


double* 


(2 L - 1 + l) d 


Solution vector of the Z-th level 


h 


double* 


(2 L - 1 + l) d 


Right-side vector of the Z-th level 


n 


double* 


(2 L - 1 + l) d 


Residual of the Z-th level fi — fi — A ; it; 


tols 

INI 


double 


scalar 


Stopping criterion ^j^r^ < tol 
Shortened symbol for the 2-norm or | • | £ 2 
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